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Abstract 

Neutral macroevolutionary models, such as the Yule model, give rise to a probability 
distribution on the set of discrete rooted binary trees over a given leaf set. Such models 
can provide a signal as to the approximate location of the root when only the unrooted 
phylogenetic tree is known, and this signal becomes relatively more significant as the 
number of leaves grows. In this short note, we show that among models that treat all 
taxa equally, and are sampling consistent (i.e. the distribution on trees is not affected by 
taxa yet to be included), all such models, except one, convey some information as to the 
location of the ancestral root in an unrooted tree. 
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1. Introduction 

Random neutral models of speciation (and extinction) have been a central tool for 



studying macroevolution, since the pioneering work of G.U. Yule in the 1920s (Yule 



1925). Such models typically provide a probability distribution on rooted binary trees 
for which the leaf set comprises some given subset of present-day taxa. The 'shape' of 



these trees has been investigated in various phylogenetic studies (see, for example, Blum 



and Francois (2006 )) as it reflects properties of the underlying processes of speciation and 



extinction. Ignoring the branch lengths and considering just the topology of the trees 
provides not only a more tractable analysis, it also allows for a fortuitous robustness: 
several different processes (e.g. time- or density- dependent speciation and extinction 
rates) lead to the same probability distribution on discrete topologies, even though the 



processes are quite different when branch lengths are considered (Aldous, 1995). 



In this short paper, we are concerned with the extent to which phylogenetic models for 
tree topology convey information as to where the tree is rooted. This is motivated in part 



by the fact that such models are used as priors in phylogenetic analysis (Jones (2011 ), Ve 



lasco (2008)) and that sequence data analysed assuming the usual time-reversible Markov 
processes typically returns an unrooted tree (i.e. the location of the root is unknown). 
Many techniques attempt to estimate the root of the tree using the data and additional 
assumptions (e.g. a molecular clock or the inclusion of an additional taxon that is known 
to be an 'outgroup'), or using properties of the tree that depend on branch length (e.g. 
'midpoint rooting', where the tree is rooted in the middle of the longest path between 



any two leaves) (Boykin et al. , 2010). 
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Here, we are interested in a much more basic question: what (if any) information the 
prior distribution on the topology alone itself might carry as to the location of the root 
of the tree. While it has been known that some models convey root-location information, 
we show here a stronger result - all models that satisfy a natural requirement (sampling 
consistency) have preferred root locations for a tree, except for one very special model. 
We begin by recalling some phylogenetic terminology. 

1.1. Definitions and notation 

Let R{n) denote the finite set of rooted binary phylogenetic trees on the leaf set 
[n] = {1, 2, . . . , n} for n > 2; this set has size (2n — 3)!! = lx3x5x---x (2n — 3) (see. 



for example, Semple and Steel (2003)). 

Given a tree T G R{n), let T~p denote the unrooted binary phylogenetic tree obtained 
by suppressing the root vertex p. This many-to-one association T i— )■ T~p is indicated 
in Fig. 1, where we have used * to indicate the edge of T'^ on which the root vertex p 
would be inserted in order to recover T. Notice that placing p at the midpoint of any 



edge of the tree on the right in Fig. 1.1 leads to a different rooted tree. 

Thus, since an unrooted binary tree with n leaves has 2n — 3 edges, if B{n) denotes 
the set of unrooted binary phylogenetic trees on leaf set [n] then |-R(?^)| = {2n — 3)\B{n)\. 






rp-p 



Figure 1: One of the 105 trees in i?(5) and the associated unrooted tree T~^p in -6(5) obtained from T 
by identifying the two edges incident with p by a single edge (indicated by *). Each of the seven edges 
of T^P corresponds to a different rooted tree. 



Given a subset Y of [n], and a (rooted or unrooted) binary phylogenetic tree T with 
leaf set [n], let T\Y denote the induced binary tree (root or unrooted, respectively) that 
connects the leaves in Y (for further details, see Semple and Steel ( 2003[ )). 

Now suppose we have some random process for generating a rooted binary tree on 
leaf set [n]. We will denote the resulting randomly-generated tree as Tn- Thus, Tn is an 
element of R{n) while 7^"'' is an element of B{n). To distinguish more clearly between 
rooted and unrooted trees, we will often write Tjj to make it clear that we are referring 
to an unrooted tree. 

1.2. Properties of neutral phylogenetic models 



A well-studied probability distribution on R{n) is the Yule-Harding model (Harding 



1971 [Yule 1925), which can be described recursively as follows. Start with a tree with 
two (unlabelled) leaves, and repeatedly apply the following construction: For the tree 
constructed thus far, select a leaf uniformly at random and attach a new leaf to the edge 
incident with this leaf (by a new edge) and continue until the tree has n leaves. This 



produces a random rooted binary tree with n unlabelled leaves (sometimes referred to as 
a 'tree shape'). We then generate an element of R{n) by assigning the elements of [n] 
randomly to the leaves of this tree. 

This probability distribution arises under quite general conditions, provided an ex- 
changeability assumption is made. We describe this briefly here (for more detail, see 
Aldous (1995)). Consider any model of speciation and extinction in which the rates of 



these two events can either be constant or vary arbitrarily with time (and even depend 
on the past or on the number of lineages present). Then provided that each event (speci- 
ation or extinction) is equally likely to affect any one of the extant lineages at any given 
time, the resulting probability distribution on R{n) is that described by the Yule-Harding 
model (moreover, this model also provides an equivalent distribution on tree topologies 
to that given by the Kingman coalescent model from population genetics, when, once 



again, branch lengths are ignored (Aldous, 1995; Zhu et al., 2011)). 



A feature of the Yule-Harding model is that given just the associated unrooted tree 
T^'^, one can readily calculate the maximum likelihood (ML) estimate of the edge(s) of 
T^p on which the root node p was located (such ML edges are always incident with one 
of the (at most two) centroid vertices of T^''); moreover, the probability that an ML-edge 
contains the root node tends to a non-zero constant (41og(4/3) — 1 ~ 0.15) as n — )• oo 



Steel and McKenzie (2001). Indeed, if we consider the edges within three edges of this 



ML edge, the probability that at least one of them contains the root node is close to 0.9 
Steel and McKenzie (2001). Thus, on a very large unrooted tree, Tjj, we can isolate the 



likely location of the root to a relatively small proportion of edges (edges that are incident 
with, or near to the centroid vertex of Tu). Similar results concerning the initial (root) 



vertex for a quite different model of tree growth were derived in 1970 Haigh (1970). 



On the other hand, a model which provides no hint as to which edge in the associ- 
ated unrooted tree might have contained the root is the PDA model (for 'proportional to 
distinguishable arrangements'), which is simply the uniform probability distribution on 
R{n). This model is not directly described by a model of macroevolution involving speci- 
ation and/or extinction, in the same way that the Yule-Harding distribution is, although 
it is possible to derive the PDA distribution under somewhat contrived evolutionary as- 
sumptions. These include conditioning on events such as a short window of opportunity 



for speciation and the survival of the tree to produce n leaves (Steel and McKenzie, 2001 ), 



conditioning on a critical binary branching process producing n leaves (Aldous, 1995); or 



a general conditional independence assumptions regarding parent and daughter branches 



in a tree (Pinelis 2003). 



Notice that both the PDA and Yule-Harding models comprise a family of probability 
distributions (i.e. they provide a probability distribution on R{n) for each n > 2). 
Thus we will refer to any such sequence (p„ : n > 2) of probability distributions on 
{R{n) : n > 2) as a phylogenetic model M. for {R{n) : n > 2), and we will also write 
P(7^ = T) for the probability Pn(T) when T G R{n). 

We now list three properties that any family of probability distributions on {R{n) : 
n > 2) can posses. The first two (the exchangeability property (EP) and sampling 
consistency (SC)) are satisfied by several models (including the PDA and Yule-Harding 
model), while the third, root invariance (RI), does not hold for the Yule-Harding model, 
as we saw above, however it does hold for the PDA model. 



EP [Exchangeability property] For each n > 2, if T and T' are trees in R{n) and T' is 
obtained from T by permuting its leaf labels, then Pn{T') = pn{T). 

SC [Sampling consistency] For each n > 2, and each tree T in R{n) we have: 

p(r„+i|M = r)=j9„(T). 

RI [Root invariance] For each n > 2, if T and T' are trees in R{n) and they are 
equivalent up to the placement of their roots (i.e. T"'' = T'"'') thenp„(T') = p„(T). 



The exchangeability property (EP), from Aldous (1995), requires the probability of 



a particular phylogenetic tree to depend just on its shape and not on how its leaves are 



labelled (this property is called 'label- invariance' in Steel and Penny (1993)). 

The sampling consistency (SC) property, also from Aldous (1995), states that the 
probability distribution on trees for a given set of taxa does not change if we add another 
taxon (namely, n + 1) and consider the induced marginal distribution on the original set 
of taxa. The condition seems reasonable if we wish the model to be 'stable' in that sense 
that, in the absence of any data associated with the taxa, p„ should depend just on the 
taxa present, and not on taxa yet to be discovered (or not included in the set of taxa 
under study )Q 

The root-invariance (RI) property states that the model does not prefer any particular 
rooting of a tree (i.e. any re- rooting of the tree would have equal probability). 

Note that the three properties EP, SC and RI pertain to distributions on trees in 
which the taxa have yet to have any biological data associated with them (i.e. they are 
'prior' to considering any particular data). If the taxa come with data that is used to 
construct a probability distribution on trees, then clearly EP will often not hold, and 
SC could also fail, since data provided by an additional taxon can influence the relative 
support for trees on an existing set of taxa. RI may or may not fail, depending on the 
assumptions of the model (e.g. whether or not it is time-reversible, or whether or not a 
molecular clock is imposed). 

Several families of probability distributions on {R{n) : n >2) satisfy both EP and SC, 



including the ^P -splitting model\ a one-parameter family that was described by Aldous 
(1995) and which includes, as special cases, the Yule-Harding model (when /3 = 0) and 



the PDA model (when f3 = —3/2). Other phylogenetic models satisfying SC have also 



been studied recently in Jones (2011). Models satisfying EP and SC have been studied 



(and characterised) recently by Haas et al. (2008) and McCuUagh et al. (2008). 

If we consider the combination of EP and RI several possible distributions on R{n) 
satisfy these two properties, since one may simply define any probability distribution on 
unrooted binary tree shapes, and extend this to labelled and rooted trees by imposing 
RI and EP. 

Finally, consider the combination of SC and RI. The PDA distribution satisfies these 
two properties, and, as noted already, it also satisfies EP. The point of this short note 
is to show that, unlike the other two combinations of properties, apart from the PDA 
model, there is no other phylogenetic model that satisfies the combination SC and RI. 
This 'impossibility' result is of similar spirit to (but is quite unrelated to) the result of 



^Note that in SC, the probabihty P(7;.+i|[n] = T) is simply the sum of p„(T') over all T' e R{n 
for which T"|[n] — T, so SC is a linear constraint that applies between the p„ and Pn+i values. 



Velasco (2007) concerning phylogenetic models that are uniform on clades of all given 



sizes. 



2. Results 



We now state the main result of this short note, the proof of which is given in the 
Appendix. 

Theorem 1. A phylogenetic model Ai = {pn : n > 2) for {R{n) : n > 2) satisfies the 
two properties SC and RI if and only if Ai is the PDA model. 

The relevance of this theorem is that the PDA model does not describe the shape of 
most published phylogenetic trees derived from biological data very well, as the latter 



trees are typically more balanced than the PDA model predicts (Aldous, 1995 Aldous 



et al. 2011 Blum and Francois, 2006). Moreover, as noted already, the PDA model does 
not have a compelling biological motivation. Thus, the significance of Theorem [l] is that 
any 'biologically realistic' sampling consistent distribution on discrete rooted phylogenetic 
trees necessarily favours some root locations over others in the associated unrooted tree 
topology. And this holds without knowledge of the branch lengths or, indeed, of any 
data. 

As a corollary of Theorem 1, the only value of fi for which the /3-splitting model 
satisfies sampling consistency and root invariance is /9 = —3/2, which corresponds to the 
PDA model. For other values of /3, it may be of interest to determine how accurately one 
can estimate the exact (or approximate) location of the edge of an unrooted tree that 
contained the root node, when the rooted tree evolved under the /3-splitting model. 

Finally, we note that the Theorem [T] does not require EP to hold at any point, however 
it falls out as a second consequence of the theorem that any phylogenetic model that satis- 
fies SC and RI must also satisfy EP (since this holds for the PDA model). The conditions 
EP and SC also apply to phylogenetic models for finrooted trees - simply replace R{n) 
with B{n) in their definition - and the reader may wonder whether Theorem [I] is merely 
a consequence of a result that states that any phylogenetic model on unrooted trees that 
satisfies SC is uniform. Such a result, if true, would indeed imply Theorem [l} but such a 
result does not hold, even if we append condition EP to SC; a simple counterexample is 
the probability distribution on unrooted trees induced by the Yule-Harding model. 
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3. Appendix: Proof of Theorem [T| 

Since tlie PDA model clearly satisfies RI and SC, it remains to establish the 'only 
if claim. To this end, we first establish the following: For any phylogenetic model 
Ai = {jpn : n > 2) for (-R(n) : n >2) that satisfies SC, and any Tu E B{n), the following 
identity holds: 

(1) nxrM = Tu) = nr^^' = Tu). 

To estabUsh (11), first observe that for any tree T„+i in R{n + 1), one has T~_^;^|[r7,] = 
(T„+i|M)-^. Thus, 

p(7;TiIN = Tu) = P((r„+i|N)-^ = Tu)= Yl nrn+i\[n] = n 

TeR{n):T~P=Tu 

and, by SC, we can express this as: 

TeR{n):T-P=Tu 

which establishes the claimed identity ([T]). 

Returning to the proof of Theorem]^ observe that, by RI, we can describe the prob- 
ability distribution pn as that in equivalent to the one in which we first generate an 
unrooted phylogenetic tree Tu G B{n) with some associated probability q(Tu) and then 
select one of the edges of Tu uniformly at random to subdivide as the root vertex. We 
will refer to this second (uniform) process as the root- edge selection process. Thus, for 
any Tu G B{n), we have: 

(2) qiTu)=nr-' = Tu), 

and, for any T G R{n), we have, from RI that: 

(3) P(r„ = r) = j^,(r-a 

Let b{n) = |-B(n)| (i.e. b{n) = (2n — 5)!! = \R{n — 1)|). We will show by induction 
on n that q is uniform on B{n) for all ri > 2 (i.e. q(Tu) = w^ for all Tu G B{n) and all 
n > 2). This induction hypothesis holds for n = 2, so supposing that it holds for n > 2, 
we will use this to show that q is uniform on B{n + 1). First observe that, for Tu G B{n), 
we can apply identity ([I]), since SC holds, to deduce that: 

(4) nr-M[n] = Tu) = q{Tu) ^ 



b(n) 



where the second equality holds by the induction hypothesis. 

Now, for each edge e of Tu consider the tree T^ G R{n + 1) obtained from Tu by 
attaching leaf n + 1 to the midpoint of edge e with a new edge. Let p{e) = F{Ae\BT,j) 
where Ag, Bt^, denote the nested events defined by: 



A := r-:, = T- and Bt, := '(V/JIM 



'-U 



Thus p is a probability distribution on the edges of Tu and our aim is to show that p 
is uniform. To this end, recall that on T^ each edge has a uniform root-edge selection 



probability by RI. This implies (by SC) that the root-edge selection process on Tu will 
select e with the following probability: 

(5) p{e)-^ + {l-p{e)) 



2n-l ' "' "2n- 1 

In this expression, the term 2?7, — 1 in the denominator is simply the number of edges of 
T^, the numerator term 3 corresponds to the three edges of T^ consisting of the edge 
incident with n + 1 and the two edges incident with that edge. Since the root-edge 
selection process on the edges of Tj/ is also uniform, and this tree has 2n — 3 edges, it 
follows from the expression in ([s]) that: 

Pie)- - + (l-p(e)); 



2n- 1 ' "^'''272-1 2n-3 

Thus, p{e) = (2n_^) , and so p is the uniform distribution on the edges of Tu, as claimed. 
Finally, observe that the uniformity of p now entails that q is uniform on B{n + 1) 
since for any tree T^ G B{n + 1), we can select T^ G B{n) and an edge e of Tu for which 
TIj = T^j and then: 

q{T[j) = q{Tlj) = P(r„7; = T^) = P(A) = P(A&i?T,) = P(Ae|5TjP(STj, 

and thus: 

^^ ^' 2n-3 h{n) h{n + l)' 

Thus, we have established the induction step required to show that q is uniform on 
B{n) for all n > 2. It now follows from Eqn. ^ that p„ is uniform on -R(n), for all 
n > 2, which completes the proof of Theorem 1. □ 



